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Abstract 

We  present  a  vision  system  for  the  3D  model-based  tracking  of  unconstrained  human 
movement.  Using  image  sequences  acquired  simultaneously  from  multiple  views,  we  recover 
the  3D  body  pose  at  each  time  instant  without  the  use  of  maxkers.  The  pose-recovery  problem 
is  formulated  as  a  search  problem  and  entails  finding  the  pose  parameters  of  a  graphical 
human  model  whose  synthesized  appeaxance  is  most  similar  to  the  actual  appearance  of  the 
real  human  in  the  multi-view  images.  The  models  used  for  this  purpose  are  acquired  from 
the  images.  We  use  a  decomposition  approach  and  a  best-first  technique  to  search  through 
the  high  dimensional  pose  parameter  space.  A  robust  variant  of  chamfer  matching  is  used 
as  a  fast  similarity  measure  between  synthesized  and  real  edge  images. 

We  present  initial  tracking  results  from  a  large  new  Humans-In-Action  (HIA)  database 
containing  more  than  2500  frames  in  each  of  four  orthogonal  views.  The  four  image  streams 
are  synchronized.  They  contain  subjects  involved  in  a  variety  of  activities,  of  various  degrees 
of  complexity,  ranging  from  simple  one-person  hand  waving  to  two-person  close  interaction 
in  the  Argentine  tango. 
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1  Introduction 


The  ability  to  recognize  humans  and  their  activities  by  vision  is  a  key  feature  in  the  pursuit  of 
designing  machines  capable  of  interacting  intelligently  and  elFortlessly  in  a  human-inhabited 
environment.  Besides  this  long-term  goal,  many  applications  are  possible  in  the  relatively 
near  term,  e.g.  in  virtual  reality,  “smart”  surveillance  systems,  motion  analysis  in  sports, 
choreography  of  dance  and  ballet,  sign  language  translation,  and  gesture-driven  user  in¬ 
terfaces.  In  many  of  these  applications  a  non-intrusive  sensory  method  based  on  vision  is 
preferable  over  a  method  (in  some  cases  not  even  feasible)  that  relies  on  markers  attached 
to  the  bodies  of  human  subjects. 

Our  approach  to  looking  at  humans  and  recognizing  their  activities  has  two  major  com¬ 
ponents: 

1.  body  pose  recovery  and  tracking 

2.  recognition  of  movement  patterns 

Several  choices  have  to  be  made  in  connection  with  body  pose  determination  and  tracking, 
which  affect  what  features  can  be  used:  the  type  of  model  used  (stick  figure,  volumetric 
model,  none),  the  dimensionality  of  the  space  in  which  tracking  takes  place  (2D  or  3D), 
the  number  of  sensors  used  (single,  stereo,  multiple),  the  sensor  modality  (visible  light, 
infrared,  range),  the  sensor  placement  (centralized  vs.  distributed)  and  mobility  (stationary 
vs.  moving).  We  consider  the  case  where  we  have  multiple  stationary  (visible- light)  cameras, 
previously  calibrated,  and  we  observe  one  or  more  humans  performing  actions  from  multiple 
viewpoints.  The  aim  of  the  first  component  of  our  approach  is  to  reconstruct  from  the 
sequence  of  multi- view  frames  the  (approximate)  3D  body  pose(s)  of  the  human(s)  at  each 
time  instant;  this  serves  as  input  to  the  movement  recognition  component.  In  an  earlier 
paper  [6]  movement  recognition  was  considered  as  a  classification  problem  and  a  Dynamic 
Time  Warping  method  was  used  to  match  a  test  sequence  with  several  reference  sequences 
representing  prototypical  activities.  The  features  used  for  matching  were  various  3D  joint 
angles  of  the  human  body.  In  this  paper,  we  focus  on  the  pose  recovery  and  tracking 
component  of  our  system. 
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The  outline  of  this  paper  is  as  follows.  Section  2  provides  a  motivation  for  our  choice 
of  a  3D  recovery  approach  rather  than  a  2D  approach.  In  Section  3  we  discuss  3D  human 
modeling  issues  and  the  (semi-automatic)  model  acquisition  procedure  used  by  our  system. 
Section  4  deals  with  the  pose  recovery  and  tracking  component.  Included  is  a  bootstrapping 
procedure  to  start  the  tracking  or  to  re-initialize  it  if  it  fails.  Section  5  presents  new  ex¬ 
perimental  results  in  which  successful  unconstrained  whole-body  movement  is  demonstrated 
on  two  subjects.  These  are  initial  results^  derived  from  a  large  Humans-In-Action  (HIA) 
database  containing  two  subjects  involved  in  a  variety  of  activities,  of  various  degree  of  com¬ 
plexity.  We  discuss  our  results  and  possible  improvements  in  Section  6.  Finally,  Section  7 
contains  our  conclusions. 

2  2D  vs.  3D 

One  may  question  whether  it  is  desirable  or  feasible  to  try  to  recover  3D  body  pose  from  2D 
image  sequences  for  the  purpose  of  recognizing  human  movement.  An  alternative  approach  is 
to  work  directly  with  2D  features  derived  from  the  images.  Model-free  2D  features  are  usually 
obtained  by  applying  a  motion-detection  algorithm  to  the  image  (assuming  a  stationary 
camera)  and  obtaining  the  outline  of  a  moving  object,  presumably  human.  Frequently,  a 
K  xN  spatial  grid  is  superimposed  on  the  motion  region,  after  a  possible  normalization  of  its 
extent.  In  each  of  the  K  x  N  tiles  a  simple  feature  is  computed,  and  these  are  combined  to 
form  &KxN  feature  vector  to  describe  the  state  of  movement  at  time  t.  This  is  the  approach 
taken  by  Polana  and  Nelson  [23]  and  Darrell  and  Pentland  [4].  Another  possibility  is  to  use 
2D  model-based  features,  where  the  assumption  is  that  as  a  result  of  2D  segmentation  and 
tracking  a  sequence  of  2D  stick  figure  poses  is  available.  For  example,  Goddard  [8]  uses 
the  2D  angular  velocities  and  orientations  of  the  links  as  features.  Guo  et  al.  [10]  uses  a 
combination  of  link  orientations  and  joint  positions  of  the  stick  figure. 

Recognition  systems  using  2D  model-free  features  have  had  early  successes  in  matching 
human  movement  patterns.  For  constrained  types  of  human  movement  (such  as  walking 
parallel  to  the  image  plane,  involving  periodic  motion),  many  of  these  features  have  been 

^The  tracking  results  described  in  this  paper  are  also  available  as  video  clips  from  our  home  pages. 
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successfully  used  for  classification,  as  in  [23].  This  may  indeed  be  the  easiest  and  best  so¬ 
lution  for  several  applications.  But  we  find  it  unlikely  that  reliable  recognition  of  more 
unconstrained  and  complex  human  movements  (e.g.  humans  wandering  around,  making  ges¬ 
tures  while  walking  and  turning)  can  be  achieved  using  these  types  of  features  exclusively. 
With  respect  to  using  2D  model-based  features,  we  note  that  few  systems  actually  derive 
the  features  they  use  for  movement  matching.  Self-occlusion  makes  the  2D  tracking  problem 
hard  for  arbitrary  movements  and  thus  existing  systems  assume  some  a  priori  knowledge  of 
the  type  of  movement  and/or  the  viewpoint  under  which  it  is  observed  [1,  19].  2D  labeling 
and  tracking  under  more  general  conditions  is  attempted  by  [16]. 

We  therefore  investigate  in  this  paper  the  more  general-purpose  approach  of  recovering  3D 
pose  through  time,  in  terms  of  3D  joint  angles  defined  with  respect  to  a  human-centered  [17] 
coordinate  system.  3D  motion  recovery  from  2D  images  is  often  an  ill-posed  problem.  In  the 
case  of  3D  pose  tracking,  however,  we  can  take  advantage  of  the  available  a  priori  knowledge 
about  the  kinematic  and  shape  properties  of  the  human  body  to  make  the  problem  tractable. 
Tracking  also  is  well  supported  by  the  use  of  a  3D  human  model  which  can  predict  events 
such  as  (self)  occlusion  and  (self)  collision.  Once  3D  tracking  is  successfully  completed,  we 
have  the  benefit  of  being  able  to  use  the  3D  joint  angles  as  features  for  movement  matching, 
which  are  viewpoint  independent  and  directly  linked  to  the  body  pose.  Compared  with  3D 
joint  coordinates,  they  are  less  sensitive  to  variations  in  the  size  of  the  human. 

The  techniques  described  in  this  paper  lead  to  tracking  on  a  fine  scale,  with  the  obtained 
joint  angles  being  within  a  few  degrees  of  their  true  values.  Besides  providing  meaningful 
generic  features  for  a  movement  matching  component,  such  techniques  are  of  independent 
interest  for  their  use  in  virtual  reality  applications.  In  other  applications,  such  as  surveillance, 
continuous  fine-scale  3D  tracking  will  not  always  be  necessary,  and  can  be  combined  with 
tracking  on  a  more  coarse  level  (for  example,  considering  the  human  body  as  a  single  unit), 
changing  the  mode  of  operation  from  one  to  another  depending  on  context.  For  related  work 
by  Intille  and  Bobick  see  [13]. 
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3  3D  body  modeling  and  model  acquisition 


3D  graphical  models  for  the  human  body  generally  consist  of  two  components:  a  repre¬ 
sentation  for  the  skeletal  structure  (the  “stick  figure”)  and  a  representation  for  the  flesh 
surrounding  it.  The  stick  figure  is  simply  a  collection  of  segments  and  joint  angles  with  var¬ 
ious  degree  of  freedom  at  the  articulation  sites.  The  representation  for  the  fiesh  can  either 
be  surface-based  (using  polygons,  for  example)  or  volumetric  (using  cylinders,  for  example). 
There  is  a  trade-off  between  the  accuracy  of  representation  and  the  number  of  parameters 
used  in  the  model.  Many  highly  accurate  surface  models  have  been  used  in  the  field  of 
graphics  [2]  to  model  the  human  body,  often  using  thousands  of  polygons  obtained  from 
actual  body  scans.  In  vision,  where  the  inverse  problem  of  recovering  the  3D  model  from  the 
images  is  much  harder  and  less  accurate,  the  use  of  volumetric  primitives  has  been  preferred 
to  “flesh  out”  the  segments  because  of  the  lower  number  of  model  parameters  involved. 


For  our  purposes  of  tracking  3D  whole-body  motion,  we  currently  use  a  22-DOF  model 
(3  DOF  for  the  positioning  of  the  root  of  the  articulated  structure,  3  DOF  for  the  torso 
and  4  DOF  for  each  arm  and  each  leg),  without  modeling  the  palm  of  the  hand  or  the 
foot,  and  using  a  rigid  head-torso  approximation.  See  [2]  for  more  sophisticated  methods  of 
modeling.  Regarding  shape,  we  felt  that  simple  cylindrical  primitives  (possibly  with  elliptic 
XY-cross-sections)  [5,  11,  25]  would  not  represent  body  parts  such  as  the  head  and  torso 
accurately  enough.  Therefore,  we  employ  the  class  of  tapered  super- quadrics  [18];  these 
include  such  diverse  shapes  as  cylinders,  spheres,  ellipsoids  and  hyper-rectangles.  Their 
parametric  equation  e  =  (616263)  is  given  by  [18] 


e  =  a 


(1) 


where  — 7r/2  <  u  <  7r/2,  — tt  <  u  <  tt,  and  where  Sg  =  sign(sin0)|  sin  and  Cg  = 
sign(cos 0)1  cos  0|'.  In  (1),  a  >  0  is  a  scale  parameter,  Ui, 02, as  >  0  are  aspect  ratio  parame¬ 
ters,  and  Ci,  62  axe  “squareness”  parameters.  Adding  linear  tapering  along  the  2;-axis  to  the 
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super-quadric  leads  to  the  parametric  equation  s  =  (S1S253)  [18]: 


s  = 


Vaas  / 


aa^ 


+  1  62 


es 


(2) 


where  —1  <  ti,t2  <  1  are  the  taper  parameters  along  the  x  and  y  axes.  So  far,  we  have 
obtained  satisfactory  modeling  results  with  these  primitives  alone  (see  experiments);  a  more 
general  approach  also  allows  deformations  of  the  shape  primitives  [18,  21]. 


In  this  work,  we  derive  shape  parameters  Sk  =  (0^,01,02)035^15^^5^^^2)  ^^m  the  pro¬ 
jections  of  occluding  contours  in  two  orthogonal  views,  parallel  to  the  zx-  and  2:?/-planes. 
This  involves  the  human  subject  facing  the  camera  frontally  and  sideways.  We  assume  2D 
segmentation  of  the  two  orthogonal  views;  a  way  to  obtain  such  a  segmentation  is  proposed 
in  recent  work  by  Kakadiaris  and  Metaxas  [15].  Back-projecting  the  2D  projected  contours 
of  a  quadric  gives  the  3D  occluding  contours,  after  which  a  coarse-to-fine  search  procedure 
is  used  over  a  reasonable  range  of  parameter  space  to  determine  the  best-fitting  quadric. 
Fitting  uses  chamfer  matching  (see  the  next  section)  as  a  similarity  measure  between  the 
fitted  and  back-projected  occluding  3D  contours.  Figure  1  shows  frontal  and  side  views  of 
the  recovered  torso  and  head  for  two  persons:  DARIU  and  ELLEN.  Figure  2  shows  their 
complete  recovered  models  in  a  graphics  rendering.  These  models  are  used  in  the  tracking 
experiments  of  Section  5. 


4  Pose  recovery  and  tracking 

The  general  framework  for  our  tracking  component  is  adapted  from  the  early  work  by  Rourke 
and  Badler  [26]  and  is  illustrated  in  Figure  3a.  Four  main  components  are  involved:  predic¬ 
tion,  synthesis,  image  analysis  and  state  estimation.  The  prediction  component  takes  into 
account  previous  states  up  to  time  t  to  make  a  prediction  for  time  t  +  1.  It  is  deemed  more 
stable  to  do  the  prediction  at  a  high  level  (in  state  space)  than  at  a  low  level  (in  image 
space),  allowing  an  easier  way  to  incorporate  semantic  knowledge  into  the  tracking  process. 
The  synthesis  component  translates  the  prediction  from  the  state  level  to  the  measurement 
(image)  level,  which  allows  the  image  analysis  component  to  selectively  focus  on  a  subset  of 
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POSE  PREDICTION 


POSE  ESTIMATE 


(a)  (b) 


Figure  3:  (a)  Tracking  cycle;  (b)  pose-search  cycle. 

the  first  subsection  we  cover  the  pose  estimation  component;  the  second  subsection  briefly 
covers  the  other  components. 

4.1  Pose  estimation 

One  approach  to  pose  recovery  is  to  derive  point  matches  between  a  3D  figure  and  its  2D 
projection  to  solve  for  the  former,  perhaps  using  several  images.  The  advantage  of  this  is 
that  rigorous  mathematical  analysis  can  be  applied  to  solve  for  the  3D  pose;  the  problem 
can  be  solved  using  techniques  borrowed  from  inverse  kinematics  (see  the  precursor  to  [24]), 
constrained  optimization  [29],  or  algebraic  geometry  [12].  On  the  downside,  this  approach 
requires  feature  points  (usually  the  joints)  to  be  accurately  located  in  the  images,  which  is 
quite  difficult.  Moreover,  the  approach  seems  to  be  very  sensitive  to  occlusion. 

We  therefore  pursued  an  alternative  approach  to  pose  recovery,  based  on  a  generate-and- 
test  strategy.  Here,  the  pose  recovery  problem  is  formulated  as  a  search  problem  and  entails 
finding  the  pose  parameters  of  a  graphical  human  model  whose  synthesized  appearance  is 
most  similar  to  the  actual  appearance  of  the  real  human  (see  Figure  3b).  This  approach 
has  the  advantage  that  the  measure  of  similarity  between  synthesized  appearance  and  actual 
appearance  can  now  be  based  on  whole  contours  and/or  regions  rather  than  on  a  few  points. 
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So  far,  existing  systems  which  work  on  real  images  using  this  strategy  have  had  limitations. 
Perales  and  Torres  [22]  describe  a  system  which  involves  input  from  a  human  operator. 
Hogg  [11]  and  Rohr  [25]  deal  with  the  restricted  movement  of  walking  parallel  to  the  image 
plane,  for  which  the  search  space  is  essentially  one-dimensional.  Downton  and  Drouet  [5] 
attempt  to  track  unconstrained  upper-body  motion,  but  conclude  that  the  tracking  fails 
due  to  propagation  of  errors.  Recent  work  by  Goncalves  et  al.  [9]  uses  a  Kalman-filtering 
approach  to  track  arm  movements  from  single- view  images  where  the  shoulder  remains  fixed. 
Finally,  work  by  Rehg  [24]  is  geared  towards  finger  tracking.  We  aim  to  improve  the  previous 
approaches,  where  applicable,  along  the  following  lines. 

Similarity  measure 

In  our  approach  the  similarity  measure  between  model  view  and  actual  scene  is  based  on 
arbitrary  edge  contours  rather  than  on  straight  line  approximations  (as  in  [25],  for  example); 
we  use  a  robust  variant  of  chamfer  matching  [3].  The  directed  chamfer  distance  DD{T,R) 
between  a  test  point  set  T  and  a  reference  point  set  R  is  obtained  by  summing  the  distances 
between  each  point  in  set  T  to  its  nearest  point  in  R: 

DD{T,  R)  I]  min  II  t  -  r  ||  (3) 

teT  tsT 

Its  normalized  version  is 

'm{T,R)  =  DD{T,R)/\T\  (4) 

D D{T,  R)  can  be  efficiently  obtained  in  a  two-pass  process  by  pre-computing  the  chamfer 
distance  on  a  grid  to  the  reference  set.  The  resulting  distance  map  is  the  so-called  “chamfer 
image”  (see  Figures  4b  and  4c).  It  would  be  efficient  if  we  could  use  only  DD{M,  S)  during 
pose  search  (as  done  in  [3]),  where  M  and  S  are  the  projected  model  edges  and  scene  edges, 
respectively.  In  that  case,  the  scene  chamfer  image  would  have  to  be  computed  only  once, 
followed  by  fast  access  for  different  model  projections.  However,  using  this  measure  alone 
has  the  disadvantage  (which  becomes  apparent  in  experiments)  that  it  does  not  contain 
information  about  how  close  the  reference  set  is  to  the  test  set.  For  example,  a  single  point 
can  be  really  close  to  a  large  straight  line,  but  we  may  not  want  to  consider  the  two  entities 
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very  similar.  We  therefore  use  the  undirected  normalized  chamfer  distance 


D{T,  R)  =  {DD{T,  R)  +  DD{R,  T))/2 


(5) 


Figure  4:  (a)  Scene  edge  image  (after  preprocessing);  (b)  filtered  edge  image  (model  predic¬ 
tion  in  grey,  accepted  edges  in  black);  (c)  chamfer  image. 

A  further  modification  is  to  perform  outlier  rejection  on  the  distribution  dd{t,  R).  Points 
t  for  which  dd{t,  R)  >  0  are  rejected  outright;  the  mean  jj.  and  standard  deviation  a  of  the 
resulting  distribution  is  used  to  reject  points  t  for  which  dd{t,  R)  >  fj,  +  2a. 

Other  measures  which  work  directly  on  the  scene  image  could  (and  have)  been  used  to 
evaluate  a  hypothesized  model  pose;  correlation  (see  [24]  and  [9])  and  average  contrast  value 
along  the  model  edges  (a  measure  commonly  used  in  the  snake  literature).  The  reason  we 
opted  for  preprocessing  the  scene  image  (i.e.  applying  an  edge  detector)  and  chamfer  match¬ 
ing  is  that  it  provides  a  gradual  measure  of  similarity  between  two  contours  while  having  a 
long-range  effect  in  image  space.  It  is  gradual  since  it  is  based  on  distance  contributions  of 
many  points  along  both  model  and  scene  contours;  as  two  identically  contours  are  moved 
apart  in  image  space  the  average  closest  distance  between  points  increases  gradually.  This 
effect  is  noticeable  over  a  range  up  to  a  threshold  in  the  absence  of  noise.  The  two  factors, 
graduality  and  long-range  effect,  make  (chamfer)  distance  mapping  a  suitable  evaluation 
measure  to  guide  a  search  process.  Correlation  and  average  contrast  along  a  contour,  on  the 
other  hand,  typically  provide  strong  peak  responses  but  rapidly  declining  off-peak  responses. 
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Multi-view  approach 


By  using  a  multi-view  approach  we  achieve  tighter  3D  pose  recovery  and  tracking  of  the  hu¬ 
man  body  than  by  using  one  view  only;  body  poses  and  movements  that  are  ambiguous  from 
one  view  can  be  disambiguated  from  another  view.  We  synthesize  appearances  of  the  human 
model  for  all  the  available  views,  and  evaluate  the  appropriateness  of  a  3D  pose  based  on  the 
similarity  measures  for  the  individual  views  (see  Figure  3b).  Currently,  the  contributions 
from  the  different  views  are  weighed  inversely  proportionally  to  the  distance  between  the 
human  torso  center  and  the  camera  plane  (this  uses  some  simplifying  assumptions,  among 
them  orthogonal  projection).  We  plan  to  include  a  weighting  scheme  which  reasons  locally 
(per  body  unit)  about  the  reliability  of  the  observations. 

Search 

Search  techniques  are  used  to  prune  the  high  dimensional  pose  parameter  space  (see  also 
[20]).  We  currently  use  best-first  search;  we  do  this  because  a  reasonable  initial  state  can  be 
provided  by  a  prediction  component  during  tracking  or  by  a  bootstrapping  method  at  start¬ 
up.  The  use  of  a  well-behaved  similarity  measure  derived  from  multiple  views,  as  discussed 
before,  is  likely  to  lead  to  a  search  landscape  with  fairly  wide  and  pronounced  maxima 
around  the  correct  parameter  values;  this  can  be  well  detected  by  a  local  search  technique 
such  as  best-first.  Nevertheless,  the  fact  remains  that  the  search  space  is  very  large  and 
high-dimensional  (22  dimensions  per  human,  in  our  case);  this  makes  “straight-on”  search 
daunting.  The  proposed  solution  to  this  is  search  space  decomposition.  Define  the  original 
A^-dimensional  search  space  S  at  time  t  as 

S  =  {{pi}  X  •••  X  {piv}},  Pi  =pi  -  An,...,  pi-f  A2i,  step  As,  (6) 

where  P  =  (pi, . . .  ,pv)  is  the  state  prediction  for  time  t.  We  define  the  decomposed  search 
space  S*  as 

s*  =  (Sx,S2) 

51  =  {{Ph}  X  •  •  •  X  {Pij  X  X  •  •  •  X  {p,^}} 

52  =  {{Pn}  X  •  •  ■  X  {Pi^}  X  {piM^,]  X  •  ■  •  X  {piv}} 
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(7) 

(8) 
(9) 


where  (pjj, . . .  is  derived  from  the  best  solution  to  searching  for  Si.  The  above  search 

space  decomposition  can  be  applied  recursively  and  can  be  represented  by  a  tree  in  which 
non-leaf  nodes  represent  search  spaces  to  be  further  decomposed  and  leaf  nodes  are  search 
spaces  to  be  actually  processed.  The  recursive  scheme  we  propose  for  the  pose  recovery  of 
K  humans  is  illustrated  in  Figure  5.  In  order  to  search  for  the  pose  of  the  i-th  human  in  the 
scene  we  synthesize  humans  1, . . . ,  i  —  1  with  the  best  pose  parameters  found  so  far,  and 
synthesize  humans  i  +  1, . . .  ,K  with  their  predicted  pose  parameters.  Next  we  search  for  the 
best  torso/head  configuration  of  the  z-th  human  while  keeping  the  limbs  at  their  predicted 
values,  etc. 


Figure  5:  A  decomposition  of  the  pose-search  space. 

We  have  found  in  practice  that  it  is  more  stable  to  include  the  torso-twist  parameter  in 
the  arm  (or  leg)  search  space,  instead  of  in  the  torso/head  search  space.  This  is  because 
the  observed  contours  of  the  torso  alone  are  not  very  sensitive  to  twist.  Given  that  we  keep 
the  root  of  the  articulated  figure  fixed  at  the  torso  center,  the  dimensionalities  of  the  search 
spaces  we  actually  search  are  5,  9,  and  8,  respectively. 

Initialization 

Our  bootstrapping  procedure  for  starting  the  tracking  currently  handles  the  case  where 
the  moving  objects  (i.e.  humans)  do  not  overlap  and  are  positioned  against  a  stationary 
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background.  The  procedure  starts  with  background  subtraction,  followed  by  a  thresholding 
operation  to  determine  the  region  of  interest;  see  Figure  6.  This  operation  can  be  quite 
noisy,  as  shown  in  the  figure.  The  aim  is  to  determine  from  this  binary  image  the  major  axis 
of  the  region  of  interest;  in  practice  this  is  the  axis  of  the  prevalent  torso-head  configuration. 
Together  with  the  major  axis  of  another  view,  this  allows  the  determination  of  the  major  3D 
axis  of  the  torso.  Additional  constraints  regarding  the  position  of  the  head  along  the  axis 
(currently,  implemented  as  a  simple  histogram  technique)  allow  a  fairly  precise  estimation 
of  all  torso  parameters,  with  the  exception  of  the  torso  twist  which  is  searched  for,  together 
with  the  arm/leg  parameters,  in  a  coarse  to  fine  fashion. 


Figure  6:  Robust  major  axis  estimation  using  iterative  PCA  (cameras  FRONT  and  RIGHT). 
Successive  approximations  to  the  major  axis  are  shown  in  lighter  colors. 

The  determination  of  the  major  axis  can  be  achieved  robustly  by  iteratively  applying  a 
principal  component  analysis  (PCA)  [14]  on  data  points  sampled  from  the  region  of  interest. 
At  each  iteration  the  “best”  major  axis  is  computed  using  PCA  and  the  distribution  of 
the  distances  from  the  data  points  to  this  axis  is  computed.  Data  points  whose  distances 
to  the  current  major  axis  are  more  than  the  mean  plus  twice  the  standard  deviation  are 
considered  outliers  and  removed  from  the  data  set.  This  process  results  in  the  removal  of 
the  data  points  corresponding  to  the  hands  if  they  are  located  lateral  to  the  torso,  and  also 
of  other  types  of  noise.  The  iterations  are  halted  if  the  parameters  of  the  major  axis  vary 
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by  less  than  a  user-defined  fraction  from  one  iteration  to  another.  In  Figure  6  the  successive 
approximations  to  the  major  axis  are  shown  by  straight  lines  in  increasingly  light  colors. 

4.2  The  other  components 

Our  prediction  component  works  in  batch  mode  and  uses  a  constant  acceleration  model 
for  the  pose  parameters.  In  other  words,  a  second-degree  polynomial  is  fitted  at  times 
t, . . . ,  t  —  r  + 1,  and  its  extrapolated  value  at  times  t  + 1  is  used  for  prediction.  The  synthesis 
component  uses  a  standard  graphics  renderer  to  give  the  model  projections  for  the  various 
camera  views.  Finally,  the  image  analysis  component  applies  an  edge  detector  to  the  real 
images,  performs  linking,  and  groups  the  edges  into  constant-curvature  segments.  These 
segments  are  each  considered  as  a  unit  and  either  accepted  into  or  rejected  from  the  filtered 
scene  edge  map,  a  decision  which  is  based  on  their  directed  chamfer  distances  to  the  projected 
model  edges;  see  Figure  4.  This  process  facilitates  the  removal  of  unwanted  contours  which 
could  disturb  the  scene  chamfer  image  (in  Figure  4,  for  example,  background  edges  around 
the  head  area  in  the  original  edge  image  are  absent  in  the  filtered  edge  image) . 

5  Experiments 

We  compiled  a  large  data  base  containing  multi-view  images  of  human  subjects  involved  in 
a  variety  of  activities.  These  activities  are  of  various  degrees  of  complexity,  ranging  from 
single-person  hand  waving  to  the  challenging  two-person  close  interaction  of  the  Argentine 
tango.  The  data  was  taken  from  four  (near-)  orthogonal  views  (FRONT,  RIGHT,  BACK  and 
left)  with  the  cameras  placed  wide  apart  in  the  corners  of  a  room  for  maximum  coverage; 
see  Figure  7.  The  background  is  fairly  complex;  many  regions  contain  bar-like  structures, 
and  some  regions  are  highly  textured  (observe  the  two  VCR  racks  in  the  lower-right  image  of 
Figure  7).  The  subjects  wore  tight-fitting  clothes.  Their  sleeves  were  of  contrasting  colors, 
simplifying  the  edge  detection  somewhat  in  cases  where  one  body  part  occludes  another. 

Because  of  disk  space  and  speed  limitations,  the  more  than  one  hour’s  worth  of  image 
data  was  first  stored  on  (SVHS)  video  tape.  A  subset  of  this  data  was  digitized  (properly 
aligned  by  its  time  code  (TC)),  and  makes  up  the  HIA  database,  which  currently  contains 
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Figure  7:  Epipolar  geometry  of  cameras  FRONT  (upper-left),  RIGHT  (upper-right),  BACK 
(lower-left)  and  LEFT  (lower-right):  epipolar  lines  are  shown  corresponding  to  the  selected 
points  from  the  view  of  camera  FRONT. 

more  than  2500  frames  in  each  of  the  four  views. 

The  cameras  were  calibrated  in  a  two-step  process,  first  for  the  intrinsic  parameters 
(individually)  and  then  for  the  extrinsic  parameters  (in  pairs).  We  used  an  iterative  non¬ 
linear  least  square  method  to  do  this;  it  was  developed  by  Szeliski  and  Kang  [27]  who  kindly 
made  it  available  to  us.  Figure  7  illustrates  the  outcome;  the  epipolar  lines  shown  in  the 
RIGHT,  BACK  and  LEFT  views  correspond  to  the  selected  points  in  the  FRONT  view.  One  can 
see  that  corresponding  points  lie  very  close  to  or  on  top  of  the  epipolar  lines.  Observe  how 
all  the  epipolar  lines  emanate  from  one  single  point  in  the  BACK  view:  the  FRONT  camera 
center  lies  within  its  view. 

Our  system  is  implemented  under  A.V.S.  (Advanced  Visualization  System).  Following 
its  data  flow  network  model,  it  consists  of  independently  running  modules,  receiving  and 
passing  data  through  their  interconnections.  The  implemented  A.V.S.  network  bears  a  close 
resemblance  to  Figure  3.  The  parameter  space  was  bounded  in  each  angular  dimension  by 
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±15  degrees,  and  in  each  xyz-dimension  by  ±10  cm  around  the  predicted  parameter  values. 
The  discretization  was  5  degrees  and  5  cm,  respectively.  We  kept  these  values  constant 
during  tracking. 

Figures  8-13  illustrate  tracking  for  persons  DARIU  and  ELLEN.  The  movement  performed 
can  be  described  as  raising  the  arms  sideways  to  a  90  degree  extension,  followed  by  rotating 
both  elbows  forward.  Moderate  opposite  torso  movement  takes  place  for  balancing  as  the 
arms  are  moved  forward  and  backwards.  The  current  recovered  3D  pose  is  illustrated  by  the 
projection  of  the  model  in  the  four  views,  shown  in  white.  (The  displayed  model  projections 
include  for  visual  purposes  the  edges  at  the  intersections  of  body  parts;  these  were  not 
included  in  the  chamfer  matching  process.)  It  can  be  seen  that  tracking  is  quite  successful, 
with  a  good  fit  for  the  recovered  3D  pose  of  the  model  for  the  four  views.  Figure  14  shows 
some  of  the  recovered  pose  parameters  for  the  DARIU  sequence.  Figure  15  shows  the  result  of 
movement  recognition  using  a  variant  of  Dynamic  Time  Warping  (DTW),  described  in  [6]; 
for  the  time-interval  in  which  the  elbows  rotate  forward,  we  use  the  left  hand  pose  parameters 
derived  from  the  ELLEN  sequence  as  a  template  (see  Figure  15a)  and  match  them  with  the 
corresponding  parameters  of  the  DARIU  sequence.  Matching  with  DTW  allows  (limited) 
time-scale  variations  between  patterns.  The  result  is  given  in  Figure  15b,  where  the  DTW 
dissimilarity  measure  drops  to  a  minimum  when  the  corresponding  pose  pattern  is  detected 
in  the  DARIU  sequence. 

6  Discussion 

As  we  process  more  sequences  of  our  HIA  database  our  aim  is  to  be  able  to  process  the  more 
complex  sequences,  involving  fast- varying  poses,  multiple  bodies  and  close  interactions.  One 
such  example  is  the  “Basico”  sequence,  in  which  two  persons  dance  the  basic  steps  of  the 
Argentine  tango  at  normal  speed;  see  Figure  16.  We  show  a  manual  positioning  of  the  3D 
models  of  the  dancers. 

We  consider  several  improvements  to  our  system.  On  the  image  processing  level,  we  are 
interested  in  a  tighter  coupling  between  prediction  and  segmentation.  Currently,  the  im¬ 
age  processing  component  applies  a  general-purpose  edge  detector  and  uses  prediction  only 
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Figure  8:  Tracking  sequence  D-TwoElbowRot,  ^  =  0,  cameras  FRONT,  RIGHT,  BACK  and 
LEFT. 

for  filtering  purposes.  We  are  interested  in  more  actively  using  the  prediction  information 
through  the  use  of  deformable  templates.  On  the  algorithmic  level,  we  are  interested  in 
methods  of  further  constraining  the  search  space,  based  on  either  image  flow  or  stereo  cor¬ 
respondence.  Finally,  for  performance,  we  plan  a  parallel  and  distributed  implementation  of 
our  system,  an  extension  which  is  well  supported  by  our  approach  and  A.V.S. 

7  Conclusions 

We  have  presented  a  new  vision  system  for  the  3D  model-based  tracking  of  unconstrained 
human  movement  from  multiple  views.  A  large  Humans- In- Action  database  has  been  com¬ 
piled  for  which  initial  tracking  results  were  shown.  We  can  draw  two  conclusions  from  these 
initial  experimental  results.  First,  our  calibration  and  human  modeling  procedures  sup¬ 
port  a  (perhaps  surprisingly)  good  3D  localization  of  the  model  such  that  its  projections 
match  the  all-around  camera  views.  This  is  good  news  for  the  feasibility  of  any  multi-view 
3D  model-based  tracking  method,  not  just  ours.  Second,  the  proposed  pose  recovery  and 
tracking  method  based  on,  among  others,  the  chamfer  distance  similarity  measure,  is  indeed 
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Figure  9:  Tracking  sequence  D-TwoElbowRot:  t  =  10  (cameras  FRONT,  RIGHT,  BACK  and 
left). 

able  to  maintain  a  good  fit  over  time.  This  is  encouraging  as  we  turn  to  the  more  complex 
sequences. 
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Figure  10:  Tracking  sequence  D-TwoElbowRot:  t  =  25  (cameras  FRONT,  RIGHT,  BACK  and 
left). 


Figure  11:  Tracking  sequence  E-TwoElbowRot:  t  =  0  (cameras  FRONT,  LEFT,  BACK  and 
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Figure  14:  Recovered  3D  pose  parameters  vs.  frame  number,  D-TwoElbowRot;  (a)  and 

(b) :  LEFT  and  RIGHT  ARM,  abduction  (x),  elevation  (o),  twist  (+)  and  extension  angle  (*) 

(c) :  TORSO,  abduction  (x),  elevation  (o),  twist  angle  (+)  and  x-  (dot),  y-  (dashdot),  and 
^-coordinates  (solid). 


Figure  15:  (a)  A  template  T  for  the  left  arm  movement,  extracted  from  E-TwoElbowRot; 
(b)  DTW  dissimilarity  measure  of  matching  template  T  with  the  LEFT  ARM  pose  parameters 
of  D-TwoElbowRot. 
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